Skip to content

STOMP: confirm utf-8 handling (backport #13858) #13860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 6, 2025

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented May 6, 2025

This is an intermediate conclusion/confirmation that out STOMP implementation can handle multi-byte characters in utf-8 encoding.

The question came up during Native STOMP review.

Frame parser collects bytes one by one into list and before transitioning to the next state reverses that acc list, so multi-bytes characters represented here with respective number of integers less than 255. In tests and in our code we work with headers via Erlang string literals that (at least with default source file encoding) accept unicode just fine and use utf8 as encoding. The tricky part here is that string literals are encoded as list of integers, not as list of bytes:

"headꙕr1" becomes [104,101,97,100,42581,114,49].

Binary literals without encoding:

binary_to_list(<<"headꙕr1">>).
[104,101,97,100,85,114,49] %% complete nonsense  from string perspective - truncated to 8 bits

and with:

binary_to_list(<<"headꙕr1"/utf8>>).
[104,101,97,100,234,153,149,114,49]

This last one list is exactly the list we get in frame parser.

It was confusing at the beginning until I realized I mostly fighting Erlang in tests. Newly added python test simply confirms utf-8 stuff relayed just fine. As for standard headers and our 'x-' extensions they all fit into ASCII so no problem here when we do look-ups for them using stomp_frame:header.

Bottom line:

we relay utf8 just fine, if we keep default encoding for our source files, our string literals in the code keep working.

PS.

Curiously, erlang's list_to_binary doesn't work with utf8 strings (unicode module must be used):

list_to_binary("headꙕr1").
** exception error: bad argument
     in function  list_to_binary/1
        called as list_to_binary([104,101,97,100,42581,114,49])
        *** argument 1: not an iolist term

I don't know yet if it means something for us outside STOMP, but in terms of unicode list_to_binary should be replaced with unicode:characters_to_binary:

unicode:characters_to_binary("headꙕr1").
<<104,101,97,100,234,153,149,114,49>>

However, all our protocol strings fit into first 128 ASCII codes so like we are just fine.


This is an automatic backport of pull request #13858 done by [Mergify](https://mergify.com).

(cherry picked from commit 0ec2599)
@mergify mergify bot assigned ikavgo May 6, 2025
@michaelklishin michaelklishin added this to the 4.1.1 milestone May 6, 2025
@michaelklishin michaelklishin merged commit 0aeca40 into v4.1.x May 6, 2025
271 checks passed
@michaelklishin michaelklishin deleted the mergify/bp/v4.1.x/pr-13858 branch May 6, 2025 14:53
michaelklishin added a commit that referenced this pull request May 6, 2025
STOMP: confirm utf-8 handling (backport #13858)
(cherry picked from commit 0aeca40)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants